Introduction

With 162 games played for 30 teams a year, anything can happen. Literally anything. Baseball is also interesting, especially for data scientists, because more than any other sport, much of the sport is based of off statistics. Don’t get it wrong, other sports also rely on similar statistics to develop the right stratagy. Doug Pederson uses statistics throughout every facet of how he runs the Philadelphia Eagles, especially on fourth downs, and it paid of big, leading to the Eagles first Superbowl win and led to an epic celebration.

Statistics, while important beforehand, got thrust to the forefront during the Moneyball Era. For a brief moment in time, small market franchises, teams with less money to spend on recruiting players, such as the Oakland Athletics were able to gain a slight edge over big market teams, teams with a lot of money to spend recruiting players. Now that big market teams have caught on, how does a small market team, like the Houston Astros, beat a big market team, like the Los Angeles Dodgers in game 7 of the World Series? The answer: youth.

Young players are inexpensive, but a bit risky. They could turn out to be the next Mike Trout, or the next Mark Appel. Most importantly, they have six years of team control when they first hit the major leagues, meaning that only the team the player is on can sign them, unless, of course, they don’t want to. This has lead teams to trading their older, better players for young talent, rebuilding their farm system with potential stars. When those stars hit The Show and a team thinks they have the talent to make a World Series push, they will trade for other star players, often from teams looking to rebuild, to fill the gaps in talent that they have, sacrificing some of the youth in their minor leagues.

This study will follow five teams that have followed this model more or less to see how effective this stratagy is. The study will show how R can be used to easily manipulate data, removing less important stats and adding in more important ones, graph data, and make theories. The teams that this study will follow are the Chicago Cubs, the Houston Astros, the Philadelphia Phillies, the New York Yankees, and the Atlanta Braves. The Cubs and Astros were chosen as both ended long championship droughts with a young team. The Phillies went on to win the World Series in 2008, but crashed in 2011 and have been rebuilding since. The Braves looked poised to start making a run in the early 2010s, but crashed hard. Both the Phillies and Braves are coming out of the rebuilding and so far in 2018, both have looked successful. Yankees have always seemed good, but went through a brief rebuild in 2016. It was shorter than most others as they were already competitive, making the playoffs in 2017.

Data Gathering

There are two main sources of baseball data available online: and . This study will take from baseball-reference. Data will be taken from baseball-reference to show how data can be important via CSV files and via the internet. Ten years of data will be taken from 2008 to 2008 since the 2018 season has just begun. The 2018 year will prove to be an interesting addition as every team, with maybe the exception of the Cubs, are off to great starts. Even then, Cubs are doing well, but not as well as one might expect.

There are a number of ways to import datasets into R, whether it by from an SQL database, CSV, or from an html webpage. Baseball-reference allows you to easily download CSVs, which you can put into your working directory. In the following example, we’ve taken the batting and pitching general statistics CSVs from the 2008 Chicago Cubs. The files were downloaded as xls files and converted to csv files using Excel. Using read_csv from the tidyverse makes this really easy.

chc_08_bat <- read_csv("sportsref_chc_08_bat.xls.csv")
chc_08_pitch <- read_csv("sportsref_chc_08_pitch.xls.csv")
head(chc_08_pitch,10)

Easy, right? One command, one dataframe. Parsing from a website is a bit trickier as you have to locate where in the source html the table you want is stored. However, it does come with the benefit of not needing to download extra files. If we were to get all our data from downloaded files, we would have 100 CSVs. 10 years and five teams with two tables per adds up quickly. The following example shows how we can extract those tables directly from baseball-reference and into our program using the 2008 Houston Astros.

url_stros_08 <- "https://www.baseball-reference.com/teams/HOU/2008.shtml"
hou_08_bat <- url_stros_08 %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_08_pitch <- url_stros_08 %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
head(hou_08_bat,10)
##    Rk Pos           Name Age   G  PA  AB   R   H 2B 3B HR RBI SB CS BB  SO
## 1   1   C    Brad Ausmus  39  81 250 216  15  47  8  0  3  24  0  2 25  41
## 2   2  1B Lance Berkman#  32 159 665 554 114 173 46  4 29 106 18  4 99 108
## 3   3  2B  Kazuo Matsui#  32  96 422 375  58 110 26  3  6  33 20  5 37  53
## 4   4  SS  Miguel Tejada  34 158 666 632  92 179 38  3 13  66  7  7 24  72
## 5   5  3B   Ty Wigginton  30 111 429 386  50 110 22  1 23  58  4  6 32  69
## 6   6  LF     Carlos Lee  32 115 481 436  61 137 27  0 28 100  4  1 37  49
## 7   7  CF Michael Bourn*  25 138 514 467  57 107 10  4  5  29 41 10 37 111
## 8   8  RF   Hunter Pence  25 157 642 595  78 160 34  4 25  83 11 10 40 124
## 9  Rk Pos           Name Age   G  PA  AB   R   H 2B 3B HR RBI SB CS BB  SO
## 10  9  3B    Geoff Blum#  35 114 356 325  36  78 14  1 14  53  1  2 21  54
##      BA  OBP  SLG  OPS OPS+  TB GDP HBP SH SF IBB
## 1  .218 .303 .296 .600   61  64   4   2  6  1   3
## 2  .312 .420 .567 .986  160 314  13   7  0  5  18
## 3  .293 .354 .427 .781  107 160   3   0  7  3   0
## 4  .283 .314 .415 .729   92 262  32   6  1  3   4
## 5  .285 .350 .526 .876  129 203   9   8  0  3   1
## 6  .314 .368 .569 .937  145 248   8   3  0  5   7
## 7  .229 .288 .300 .588   57 140   3   2  7  1   0
## 8  .269 .318 .466 .783  105 277  14   4  0  3   2
## 9    BA  OBP  SLG  OPS OPS+  TB GDP HBP SH SF IBB
## 10 .240 .287 .418 .705   85 136   5   3  0  7   2

Using the rvest package, we see that importing data in this way is pretty simple too. The hardest part is looking through the html source code to find where the table is located. Fortunately, baseball-reference has made this easy for us. Sometimes, websites aren’t very good at stroing this tables, like this example which has the table hardcoded in as plaintext. In those cases, parsing the data gets a little more tricky, and regex comes into handy. Regex parsing will be covered in the next section.

One other thing to note in this example is the %>%. This is called pipe operator and it allows us to call multiple functions on a single dataset at once. We will use this many times when working with our data.

The following code obotains the rest of the data we need from the remaining pages, split up by team

url_stros <- "https://www.baseball-reference.com/teams/HOU/2009.shtml"
hou_09_bat <- url_stros %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_09_pitch <- url_stros %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2010.shtml"
hou_10_bat <- url_stros %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_10_pitch <- url_stros %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2011.shtml"
hou_11_bat <- url_stros %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_11_pitch <- url_stros %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2012.shtml"
hou_12_bat <- url_stros %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_12_pitch <- url_stros %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2013.shtml"
hou_13_bat <- url_stros %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_13_pitch <- url_stros %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2014.shtml"
hou_14_bat <- url_stros %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_14_pitch <- url_stros %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2015.shtml"
hou_15_bat <- url_stros %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_15_pitch <- url_stros %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2016.shtml"
hou_16_bat <- url_stros %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_16_pitch <- url_stros %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2017.shtml"
hou_17_bat <- url_stros %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
hou_17_pitch <- url_stros %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()

Chicago Cubs:

url_chc <- "https://www.baseball-reference.com/teams/CHC/2009.shtml"
chc_09_bat <- url_chc %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
chc_09_pitch <- url_chc %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2010.shtml"
chc_10_bat <- url_chc %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
chc_10_pitch <- url_chc %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2011.shtml"
chc_11_bat <- url_chc %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
chc_11_pitch <- url_chc %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2012.shtml"
chc_12_bat <- url_chc %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
chc_12_pitch <- url_chc %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2013.shtml"
chc_13_bat <- url_chc %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
chc_13_pitch <- url_chc %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2014.shtml"
chc_14_bat <- url_chc %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
chc_14_pitch <- url_chc %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2015.shtml"
chc_15_bat <- url_chc %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
chc_15_pitch <- url_chc %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2016.shtml"
chc_16_bat <- url_chc %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
chc_16_pitch <- url_chc %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2017.shtml"
chc_17_bat <- url_chc %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
chc_17_pitch <- url_chc %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()

Philadelphia Phillies

url_phi <- "https://www.baseball-reference.com/teams/PHI/2008.shtml"
phi_08_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_08_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2009.shtml"
phi_09_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_09_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2010.shtml"
phi_10_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_10_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2011.shtml"
phi_11_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_11_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2012.shtml"
phi_12_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_12_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2013.shtml"
phi_13_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_13_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2014.shtml"
phi_14_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_14_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2015.shtml"
phi_15_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_15_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2016.shtml"
phi_16_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_16_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2017.shtml"
phi_17_bat <- url_phi %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
phi_17_pitch <- url_phi %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()

New York Yankees

url_nyy <- "https://www.baseball-reference.com/teams/NYY/2008.shtml"
nyy_08_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_08_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2009.shtml"
nyy_09_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_09_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2010.shtml"
nyy_10_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_10_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2011.shtml"
nyy_11_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_11_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2012.shtml"
nyy_12_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_12_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2013.shtml"
nyy_13_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_13_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2014.shtml"
nyy_14_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_14_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2015.shtml"
nyy_15_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_15_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2016.shtml"
nyy_16_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_16_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2017.shtml"
nyy_17_bat <- url_nyy %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
nyy_17_pitch <- url_nyy %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()

Atlanta Braves

url_atl <- "https://www.baseball-reference.com/teams/ATL/2008.shtml"
atl_08_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_08_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2009.shtml"
atl_09_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_09_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2010.shtml"
atl_10_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_10_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2011.shtml"
atl_11_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_11_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2012.shtml"
atl_12_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_12_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2013.shtml"
atl_13_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_13_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2014.shtml"
atl_14_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_14_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2015.shtml"
atl_15_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_15_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2016.shtml"
atl_16_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_16_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2017.shtml"
atl_17_bat <- url_atl %>%
  read_html() %>%
  html_node("#team_batting") %>%
  html_table()
atl_17_pitch <- url_atl %>%
  read_html() %>%
  html_node("#team_pitching") %>%
  html_table()

Parsing

Fortunately for us, all this data comes in pretty clean, but we still need to fix it up and make it pretty. For example, in all the dataframes, except for the one we imported as a CSV, repeat the attribute names, so we’ll get rid of that. Additionally, most of the attributes are integers, but because of that pesky last line, they are listed as characters. To make matters worse, this line appears in the middle of each frame at seemingly random locations. The following example fixes this for the 2008 Atlanta Braves. The rest are fixed at the end of the example.

# This command will remove any row with an attribute in the Rk column with the value Rk
atl_08_bat <- atl_08_bat[!(atl_08_bat$Rk=="Rk"),]
atl_08_pitch <- atl_08_pitch[!(atl_08_pitch$Rk=="Rk"),]
# This will set each column to the proper type. The sapply applies the function to the specified columns
atl_08_bat[, c(1,4:28)] <- sapply(atl_08_bat[, c(1,4:28)], as.numeric)
atl_08_pitch[, c(1,4:34)] <- sapply(atl_08_pitch[, c(1,4:34)], as.numeric)
head(atl_08_pitch,10)
##    Rk Pos           Name Age  W  L  W-L%  ERA  G GS GF CG SHO SV    IP   H
## 1   1  SP  Jair Jurrjens  22 13 10 0.565 3.68 31 31  0  0   0  0 188.1 188
## 2   2  SP Jorge Campillo  29  8  7 0.533 3.91 39 25  3  1   0  0 158.2 158
## 3   3  SP     Tim Hudson  32 11  7 0.611 3.17 23 22  0  1   1  0 142.0 125
## 4   4  SP   Jo-Jo Reyes*  23  3 11 0.214 5.81 23 22  0  0   0  0 113.0 134
## 5   5  SP  Mike Hampton*  35  3  4 0.429 4.85 13 13  0  0   0  0  78.0  83
## 6   6  SP Charlie Morton  24  4  8 0.333 6.15 16 15  0  0   0  0  74.2  80
## 8   7  CL Mike Gonzalez*  30  0  3 0.000 4.28 36  0 29  0   0 14  33.2  26
## 9   8  RP   Jeff Bennett  28  3  7 0.300 3.70 72  4 12  0   0  3  97.1  86
## 10  9  RP   Blaine Boyer  26  2  6 0.250 5.88 76  0 18  0   0  1  72.0  73
## 11 10  RP    Will Ohman*  30  4  1 0.800 3.68 83  0 16  0   0  1  58.2  51
##     R ER HR BB IBB  SO HBP BK WP  BF ERA+  FIP  WHIP   H9 HR9 BB9  SO9
## 1  87 77 11 70   9 139   4  0  3 813  113 3.59 1.370  9.0 0.5 3.3  6.6
## 2  74 69 18 38   2 107   1  0  2 655  107 4.00 1.235  9.0 1.0 2.2  6.1
## 3  53 50 11 40   5  85   2  1  3 573  132 3.83 1.162  7.9 0.7 2.5  5.4
## 4  77 73 18 52   4  78   3  0  2 512   72 5.28 1.646 10.7 1.4 4.1  6.2
## 5  45 42 10 28   6  38   1  0  0 331   86 4.94 1.423  9.6 1.2 3.2  4.4
## 6  56 51  9 41   2  48   2  0  2 345   68 5.14 1.621  9.6 1.1 4.9  5.8
## 8  21 16  6 14   3  44   1  0  0 142   98 4.17 1.188  7.0 1.6 3.7 11.8
## 9  44 40  5 47   6  68   7  0  5 419  113 4.07 1.366  8.0 0.5 4.3  6.3
## 10 51 47 10 25   4  67   2  0  2 313   71 4.20 1.361  9.1 1.3 3.1  8.4
## 11 27 24  3 22   4  53   1  0  2 248  114 3.17 1.244  7.8 0.5 3.4  8.1
##    SO/W
## 1  1.99
## 2  2.82
## 3  2.13
## 4  1.50
## 5  1.36
## 6  1.17
## 8  3.14
## 9  1.45
## 10 2.68
## 11 2.41

One other thing we need to do clean this data is with the names. For whatever reason, baseball-reference decided to use characters next to a person’s name to show what side of the plate they bat from. No symbol means they bat right handed, a * means they bat left handed and a # means they bat from both sides of the plate. This blends a little into our next section, data management, because we are going to create a new attribute called bat_side that will contain a single character: R for righties, L for lefties, and S for switch hitters. We will also get rid of that symbol on the person’s name. There is also one for the pitchers, but here we won’t have to deal with switch pitchers as Pat Vendette did not play for any of these teams on during this era.

This is a good opportunity to introduce regex, which is contained in the stringr package for R. Here’s a regex cheat sheet for R. By using regex, we can check through each string and parse out the important bits and discard the rest.

atl_08_bat <- atl_08_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_08_pitch <- atl_08_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
head(atl_08_bat,10)
##    Rk Pos           Name Age   G  PA  AB  R   H 2B 3B HR RBI SB CS BB  SO
## 1   1   C   Brian McCann  24 145 573 509 68 153 42  1 23  87  5  0 57  64
## 2   2  1B  Mark Teixeira  28 103 451 381 63 108 27  0 20  78  0  0 65  70
## 3   3  2B  Kelly Johnson  26 150 614 547 86 157 39  6 12  69 11  6 52 113
## 4   4  SS  Yunel Escobar  25 136 587 514 71 148 24  2 10  60  2  5 59  62
## 5   5  3B  Chipper Jones  36 128 534 439 82 160 24  1 22  75  4  0 90  61
## 6   6  LF  Gregor Blanco  24 144 519 430 52 108 14  4  1  38 13  5 74  99
## 7   7  CF    Mark Kotsay  32  88 345 318 39  92 17  3  6  37  2  3 25  34
## 8   8  RF Jeff Francoeur  24 155 653 599 70 143 33  3 11  71  0  1 39 111
## 9   9  UT   Omar Infante  26  96 348 317 45  93 24  3  3  40  0  1 22  44
## 10 10  IF   Martin Prado  24  78 254 228 36  73 18  4  2  33  3  1 21  29
##       BA   OBP   SLG   OPS OPS+  TB GDP HBP SH SF IBB bat_side
## 1  0.301 0.373 0.523 0.896  135 266  17   4  0  3   4        L
## 2  0.283 0.390 0.512 0.902  137 195  13   3  0  2   9        S
## 3  0.287 0.349 0.446 0.795  109 244   3   2  9  4   2        L
## 4  0.288 0.366 0.401 0.766  103 206  24   5  7  2   4        R
## 5  0.364 0.470 0.574 1.044  176 252  13   1  0  4  16        S
## 6  0.251 0.366 0.309 0.676   82 133   3   6  6  3   2        L
## 7  0.289 0.340 0.418 0.758  100 133  13   0  1  1   2        L
## 8  0.239 0.294 0.359 0.653   72 215  18  10  0  4   5        R
## 9  0.293 0.338 0.416 0.755   99 132   4   2  2  5   2        R
## 10 0.320 0.377 0.461 0.838  121 105   3   1  2  2   0        R

So once you understand regex, things become pretty easy. Now we speed up the process, doing this for our 100 dataframes as seen below for your viewing pleasure.

atl_09_bat <- atl_09_bat[!(atl_09_bat$Rk=="Rk"),]
atl_09_pitch <- atl_09_pitch[!(atl_09_pitch$Rk=="Rk"),]
atl_09_bat[, c(1,4:28)] <- sapply(atl_09_bat[, c(1,4:28)], as.numeric)
atl_09_pitch[, c(1,4:34)] <- sapply(atl_09_pitch[, c(1,4:34)], as.numeric)
atl_09_bat <- atl_09_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_09_pitch <- atl_09_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))

atl_10_bat <- atl_10_bat[!(atl_10_bat$Rk=="Rk"),]
atl_10_pitch <- atl_10_pitch[!(atl_10_pitch$Rk=="Rk"),]
atl_10_bat[, c(1,4:28)] <- sapply(atl_10_bat[, c(1,4:28)], as.numeric)
atl_10_pitch[, c(1,4:34)] <- sapply(atl_10_pitch[, c(1,4:34)], as.numeric)
atl_10_bat <- atl_10_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_10_pitch <- atl_10_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))

atl_11_bat <- atl_11_bat[!(atl_11_bat$Rk=="Rk"),]
atl_11_pitch <- atl_11_pitch[!(atl_11_pitch$Rk=="Rk"),]
atl_11_bat[, c(1,4:28)] <- sapply(atl_11_bat[, c(1,4:28)], as.numeric)
atl_11_pitch[, c(1,4:34)] <- sapply(atl_11_pitch[, c(1,4:34)], as.numeric)
atl_11_bat <- atl_11_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_11_pitch <- atl_11_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))

atl_12_bat <- atl_12_bat[!(atl_12_bat$Rk=="Rk"),]
atl_12_pitch <- atl_12_pitch[!(atl_12_pitch$Rk=="Rk"),]
atl_12_bat[, c(1,4:28)] <- sapply(atl_12_bat[, c(1,4:28)], as.numeric)
atl_12_pitch[, c(1,4:34)] <- sapply(atl_12_pitch[, c(1,4:34)], as.numeric)
atl_12_bat <- atl_12_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_12_pitch <- atl_12_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))

atl_13_bat <- atl_13_bat[!(atl_13_bat$Rk=="Rk"),]
atl_13_pitch <- atl_13_pitch[!(atl_13_pitch$Rk=="Rk"),]
atl_13_bat[, c(1,4:28)] <- sapply(atl_13_bat[, c(1,4:28)], as.numeric)
atl_13_pitch[, c(1,4:34)] <- sapply(atl_13_pitch[, c(1,4:34)], as.numeric)
atl_13_bat <- atl_13_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_13_pitch <- atl_13_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))

atl_14_bat <- atl_14_bat[!(atl_14_bat$Rk=="Rk"),]
atl_14_pitch <- atl_14_pitch[!(atl_14_pitch$Rk=="Rk"),]
atl_14_bat[, c(1,4:28)] <- sapply(atl_14_bat[, c(1,4:28)], as.numeric)
atl_14_pitch[, c(1,4:34)] <- sapply(atl_14_pitch[, c(1,4:34)], as.numeric)
atl_14_bat <- atl_14_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_14_pitch <- atl_14_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))

atl_15_bat <- atl_15_bat[!(atl_15_bat$Rk=="Rk"),]
atl_15_pitch <- atl_15_pitch[!(atl_15_pitch$Rk=="Rk"),]
atl_15_bat[, c(1,4:28)] <- sapply(atl_15_bat[, c(1,4:28)], as.numeric)
atl_15_pitch[, c(1,4:34)] <- sapply(atl_15_pitch[, c(1,4:34)], as.numeric)
atl_15_bat <- atl_15_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_15_pitch <- atl_15_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))

atl_16_bat <- atl_16_bat[!(atl_16_bat$Rk=="Rk"),]
atl_16_pitch <- atl_10_pitch[!(atl_16_pitch$Rk=="Rk"),]
atl_16_bat[, c(1,4:28)] <- sapply(atl_16_bat[, c(1,4:28)], as.numeric)
atl_16_pitch[, c(1,4:34)] <- sapply(atl_16_pitch[, c(1,4:34)], as.numeric)
atl_16_bat <- atl_16_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_16_pitch <- atl_16_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))

atl_17_bat <- atl_10_bat[!(atl_17_bat$Rk=="Rk"),]
atl_17_pitch <- atl_10_pitch[!(atl_17_pitch$Rk=="Rk"),]
atl_17_bat[, c(1,4:28)] <- sapply(atl_17_bat[, c(1,4:28)], as.numeric)
atl_17_pitch[, c(1,4:34)] <- sapply(atl_17_pitch[, c(1,4:34)], as.numeric)
atl_17_bat <- atl_17_bat %>%
  mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_17_pitch <- atl_17_pitch %>%
  mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
  mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))

The other teams’ code to parse the data is hidden just so this document isn’t obscenely long, but it was proccessed.

Data Management

Ok, so now we have a metric ton of data that has been cleaned up, what can we do with it? Well, actually, we can do a lot. For example, we could calculate a player’s WAR, or Wins Above Replacement. WAR is a calculation of how good a player is over whatever Joe Schmoe would come up to replace him. It’s an incredibly useful number, but rather difficult to calculate. In fact, both baseball-reference and fangraphs calculate it in different ways. So instead, let’s keep things simple. First, let’s clean up our data.

If you notice, all of the dataframes have the last couple rows as some summary statistics. The final entry is all the attributes re-listed, but we already got rid of that. We’ll want to summary statistics, so let’s stash that away temporarily and see what we can do with the clean dataframe. We can use the slice command to show which ones we don’t want. As before, we’ll look at just one team, this time the 2008 Philadelphia Phillies.

phi_08_a <- slice(phi_08_bat,(nrow(phi_08_bat)-3))
phi_08_b <- slice(phi_08_pitch,(nrow(phi_08_pitch)-1))
phi_08 <- merge(phi_08_a,phi_08_b,by="Name",all=TRUE)
phi_08_bat <- slice(phi_08_bat, 1:(nrow(phi_08_bat)-4))
head(phi_08_bat,10)
## # A tibble: 10 x 29
##       Rk Pos   Name    Age     G    PA    AB     R     H  `2B`  `3B`    HR
##    <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1.00 C     Carl~  29.0 117     373   320  47.0  70.0  14.0  0     4.00
##  2  2.00 1B    Ryan~  28.0 162     700   610 105   153    26.0  4.00 48.0 
##  3  3.00 2B    Chas~  29.0 159     707   607 113   177    41.0  4.00 33.0 
##  4  4.00 SS    Jimm~  29.0 137     625   556  76.0 154    38.0  9.00 11.0 
##  5  5.00 3B    Pedr~  33.0 133     463   425  43.0 106    19.0  2.00 14.0 
##  6  6.00 LF    Pat ~  31.0 157     645   536  74.0 134    33.0  3.00 33.0 
##  7  7.00 CF    Shan~  27.0 146     627   570 102   167    30.0  8.00 14.0 
##  8  8.00 RF    Jays~  29.0 134     482   418  73.0 114    16.0  3.00 24.0 
##  9  9.00 RF    Geof~  33.0 115     322   293  27.0  72.0  16.0  0     9.00
## 10 10.0  C     Chri~  35.0  98.0   305   274  28.0  72.0  17.0  0     9.00
## # ... with 17 more variables: RBI <dbl>, SB <dbl>, CS <dbl>, BB <dbl>,
## #   SO <dbl>, BA <dbl>, OBP <dbl>, SLG <dbl>, OPS <dbl>, `OPS+` <dbl>,
## #   TB <dbl>, GDP <dbl>, HBP <dbl>, SH <dbl>, SF <dbl>, IBB <dbl>,
## #   bat_side <chr>

Alternatively, we can filter them out using filter

phi_08_pitch <- filter(phi_08_pitch, phi_08_pitch$Rk!='NA')
head(phi_08_pitch,10)
##    Rk Pos          Name Age  W  L  W-L%  ERA  G GS GF CG SHO SV    IP   H
## 1   1  SP   Cole Hamels  24 14 10 0.583 3.09 33 33  0  2   2  0 227.1 193
## 2   2  SP   Jamie Moyer  45 16  7 0.696 3.71 33 33  0  0   0  0 196.1 199
## 3   3  SP   Brett Myers  27 10 13 0.435 4.55 30 30  0  2   1  0 190.0 197
## 4   4  SP Kyle Kendrick  23 11  9 0.550 5.49 31 30  1  0   0  0 155.2 194
## 5   5  SP    Adam Eaton  30  4  8 0.333 5.80 21 19  0  0   0  0 107.0 131
## 6   6  SP   Joe Blanton  27  4  0 1.000 4.20 13 13  0  0   0  0  70.2  66
## 7   7  CL    Brad Lidge  31  2  0 1.000 1.95 72  0 61  0   0 41  69.1  50
## 8   8  RP   Chad Durbin  30  5  4 0.556 2.87 71  0 12  0   0  1  87.2  81
## 9   9  RP   Ryan Madson  27  4  2 0.667 3.05 76  0 14  0   0  1  82.2  79
## 10 10  RP  Clay Condrey  32  3  4 0.429 3.26 56  0 30  0   0  1  69.0  85
##      R ER HR BB IBB  SO HBP BK WP  BF ERA+  FIP  WHIP   H9 HR9 BB9  SO9
## 1   89 78 28 53   7 196   1  0  0 914  141 3.72 1.082  7.6 1.1 2.1  7.8
## 2   85 81 20 62   4 123  11  0  3 841  117 4.32 1.329  9.1 0.9 2.8  5.6
## 3  103 96 29 65   6 163   6  0  2 817   96 4.52 1.379  9.3 1.4 3.1  7.7
## 4  103 95 23 57   2  68  14  1  4 722   79 5.55 1.612 11.2 1.3 3.3  3.9
## 5   71 69 15 44   5  57   6  0  2 478   75 5.29 1.636 11.0 1.3 3.7  4.8
## 6   36 33 10 31   0  49   3  0  1 305  104 5.03 1.373  8.4 1.3 3.9  6.2
## 7   17 15  2 35   4  92   1  0  5 292  224 2.41 1.226  6.5 0.3 4.5 11.9
## 8   33 28  5 35   7  63   4  0  3 365  152 3.77 1.323  8.3 0.5 3.6  6.5
## 9   29 28  6 23   4  67   1  1  2 340  143 3.33 1.234  8.6 0.7 2.5  7.3
## 10  26 25  6 19   8  34   2  0  1 303  134 4.19 1.507 11.1 0.8 2.5  4.4
##    SO/W pitch_side
## 1  3.70          L
## 2  1.98          L
## 3  2.51          R
## 4  1.19          R
## 5  1.30          R
## 6  1.58          R
## 7  2.63          R
## 8  1.80          R
## 9  2.91          R
## 10 1.79          R

Say we decide that slugging percentage is useless, we can get rid of that using select

phi_08_bat <- select(phi_08_bat, -SLG)
head(phi_08_bat,10)
## # A tibble: 10 x 28
##       Rk Pos   Name    Age     G    PA    AB     R     H  `2B`  `3B`    HR
##    <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1.00 C     Carl~  29.0 117     373   320  47.0  70.0  14.0  0     4.00
##  2  2.00 1B    Ryan~  28.0 162     700   610 105   153    26.0  4.00 48.0 
##  3  3.00 2B    Chas~  29.0 159     707   607 113   177    41.0  4.00 33.0 
##  4  4.00 SS    Jimm~  29.0 137     625   556  76.0 154    38.0  9.00 11.0 
##  5  5.00 3B    Pedr~  33.0 133     463   425  43.0 106    19.0  2.00 14.0 
##  6  6.00 LF    Pat ~  31.0 157     645   536  74.0 134    33.0  3.00 33.0 
##  7  7.00 CF    Shan~  27.0 146     627   570 102   167    30.0  8.00 14.0 
##  8  8.00 RF    Jays~  29.0 134     482   418  73.0 114    16.0  3.00 24.0 
##  9  9.00 RF    Geof~  33.0 115     322   293  27.0  72.0  16.0  0     9.00
## 10 10.0  C     Chri~  35.0  98.0   305   274  28.0  72.0  17.0  0     9.00
## # ... with 16 more variables: RBI <dbl>, SB <dbl>, CS <dbl>, BB <dbl>,
## #   SO <dbl>, BA <dbl>, OBP <dbl>, OPS <dbl>, `OPS+` <dbl>, TB <dbl>,
## #   GDP <dbl>, HBP <dbl>, SH <dbl>, SF <dbl>, IBB <dbl>, bat_side <chr>

It can also be used to select the attributes we find useful.

As it turns out, slugging percentage is actually a really useful stat, so we need to add it back in. Fortunately, we have enough data to be able to recalculate it. Slugging percentage is calculated as: \[ \frac{\text{# of Singles} \times 1 + \text{# of Doubles} \times 2 + \text{# of Triples} \times 3 + \text{# of Homeruns} \times 4}{\text{# of At-Bats}}\]

So we can insert in a way we’ve already seen:

phi_08_bat <- phi_08_bat %>%
  mutate(SLG = ((H-`2B`-`3B`-HR)+`2B`*2+`3B`*3+HR*4)/AB)
head(phi_08_bat,10)
## # A tibble: 10 x 29
##       Rk Pos   Name    Age     G    PA    AB     R     H  `2B`  `3B`    HR
##    <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1.00 C     Carl~  29.0 117     373   320  47.0  70.0  14.0  0     4.00
##  2  2.00 1B    Ryan~  28.0 162     700   610 105   153    26.0  4.00 48.0 
##  3  3.00 2B    Chas~  29.0 159     707   607 113   177    41.0  4.00 33.0 
##  4  4.00 SS    Jimm~  29.0 137     625   556  76.0 154    38.0  9.00 11.0 
##  5  5.00 3B    Pedr~  33.0 133     463   425  43.0 106    19.0  2.00 14.0 
##  6  6.00 LF    Pat ~  31.0 157     645   536  74.0 134    33.0  3.00 33.0 
##  7  7.00 CF    Shan~  27.0 146     627   570 102   167    30.0  8.00 14.0 
##  8  8.00 RF    Jays~  29.0 134     482   418  73.0 114    16.0  3.00 24.0 
##  9  9.00 RF    Geof~  33.0 115     322   293  27.0  72.0  16.0  0     9.00
## 10 10.0  C     Chri~  35.0  98.0   305   274  28.0  72.0  17.0  0     9.00
## # ... with 17 more variables: RBI <dbl>, SB <dbl>, CS <dbl>, BB <dbl>,
## #   SO <dbl>, BA <dbl>, OBP <dbl>, OPS <dbl>, `OPS+` <dbl>, TB <dbl>,
## #   GDP <dbl>, HBP <dbl>, SH <dbl>, SF <dbl>, IBB <dbl>, bat_side <chr>,
## #   SLG <dbl>

There is a whole bunch of other things we can do, like renaming attributes, but what we are most interested in is the summary of each dataset. We start by merging datasets together by certain attributes. So instead of having two datasets per team per year, we can create one dataset for a team in a year.

phi_08_c <- merge(phi_08_bat,phi_08_pitch,by="Name",all=TRUE)
head(phi_08_c,10)
##              Name Rk.x Pos.x Age.x G.x  PA  AB R.x H.x 2B 3B HR.x RBI SB
## 1      Adam Eaton   27     P    30  22  36  28   1   5  2  0    0   1  0
## 2      Andy Tracy   21          34   4   4   2   0   0  0  0    0   1  0
## 3     Brad Harman   16    IF    22   6  11  10   1   1  1  0    0   1  0
## 4      Brad Lidge   38     P    31  71   0   0   0   0  0  0    0   0  0
## 5     Brett Myers   24     P    27  30  71  58   3   4  1  0    0   1  0
## 6     Carlos Ruiz    1     C    29 117 373 320  47  70 14  0    4  31  1
## 7     Chad Durbin   29     P    30  69   9   9   0   1  0  0    0   0  0
## 8     Chase Utley    3    2B    29 159 707 607 113 177 41  4   33 104 14
## 9     Chris Coste   10     C    35  98 305 274  28  72 17  0    9  36  0
## 10 Chris Snelling   22          26   4   4   4   1   2  1  0    1   1  0
##    CS BB.x SO.x    BA   OBP   OPS OPS+  TB GDP HBP.x SH SF IBB.x bat_side
## 1   0    5    6 0.179 0.303 0.553   47   7   0     0  3  0     0        R
## 2   0    1    1 0.000 0.250 0.250  -27   0   0     0  0  1     0        L
## 3   0    1    1 0.100 0.182 0.382    0   2   1     0  0  0     0        R
## 4   0    0    0    NA    NA    NA   NA   0   0     0  0  0     0        R
## 5   0    4   19 0.069 0.129 0.215  -42   5   1     0  9  0     0        R
## 6   2   44   38 0.219 0.320 0.620   63  96  14     4  4  1     6        R
## 7   0    0    3 0.111 0.111 0.222  -42   1   0     0  0  0     0        R
## 8   2   64  104 0.292 0.380 0.915  136 325   9    27  1  8    14        L
## 9   1   16   51 0.263 0.325 0.748   93 116   7    10  3  2     1        R
## 10  0    0    0 0.500 0.500 2.000  395   6   0     0  0  0     0        L
##          SLG Rk.y Pos.y Age.y  W  L  W-L%  ERA G.y GS GF CG SHO SV    IP
## 1  0.2500000    5    SP    30  4  8 0.333 5.80  21 19  0  0   0  0 107.0
## 2  0.0000000   NA  <NA>    NA NA NA    NA   NA  NA NA NA NA  NA NA    NA
## 3  0.2000000   NA  <NA>    NA NA NA    NA   NA  NA NA NA NA  NA NA    NA
## 4        NaN    7    CL    31  2  0 1.000 1.95  72  0 61  0   0 41  69.1
## 5  0.0862069    3    SP    27 10 13 0.435 4.55  30 30  0  2   1  0 190.0
## 6  0.3000000   NA  <NA>    NA NA NA    NA   NA  NA NA NA NA  NA NA    NA
## 7  0.1111111    8    RP    30  5  4 0.556 2.87  71  0 12  0   0  1  87.2
## 8  0.5354201   NA  <NA>    NA NA NA    NA   NA  NA NA NA NA  NA NA    NA
## 9  0.4233577   NA  <NA>    NA NA NA    NA   NA  NA NA NA NA  NA NA    NA
## 10 1.5000000   NA  <NA>    NA NA NA    NA   NA  NA NA NA NA  NA NA    NA
##    H.y R.y ER HR.y BB.y IBB.y SO.y HBP.y BK WP  BF ERA+  FIP  WHIP   H9
## 1  131  71 69   15   44     5   57     6  0  2 478   75 5.29 1.636 11.0
## 2   NA  NA NA   NA   NA    NA   NA    NA NA NA  NA   NA   NA    NA   NA
## 3   NA  NA NA   NA   NA    NA   NA    NA NA NA  NA   NA   NA    NA   NA
## 4   50  17 15    2   35     4   92     1  0  5 292  224 2.41 1.226  6.5
## 5  197 103 96   29   65     6  163     6  0  2 817   96 4.52 1.379  9.3
## 6   NA  NA NA   NA   NA    NA   NA    NA NA NA  NA   NA   NA    NA   NA
## 7   81  33 28    5   35     7   63     4  0  3 365  152 3.77 1.323  8.3
## 8   NA  NA NA   NA   NA    NA   NA    NA NA NA  NA   NA   NA    NA   NA
## 9   NA  NA NA   NA   NA    NA   NA    NA NA NA  NA   NA   NA    NA   NA
## 10  NA  NA NA   NA   NA    NA   NA    NA NA NA  NA   NA   NA    NA   NA
##    HR9 BB9  SO9 SO/W pitch_side
## 1  1.3 3.7  4.8 1.30          R
## 2   NA  NA   NA   NA       <NA>
## 3   NA  NA   NA   NA       <NA>
## 4  0.3 4.5 11.9 2.63          R
## 5  1.4 3.1  7.7 2.51          R
## 6   NA  NA   NA   NA       <NA>
## 7  0.5 3.6  6.5 1.80          R
## 8   NA  NA   NA   NA       <NA>
## 9   NA  NA   NA   NA       <NA>
## 10  NA  NA   NA   NA       <NA>

Unfortunately, this gives us a bunch of diplicate attributes, but this won’t matter because we will just ignore them for our next step: summarizing. R makes summarizing really easy, as we show below. But first, there are some missing entries from the merge, we fill those in with zeros.

# Before we can summarize, we must do some cleaning.
phi_08_c[is.na(phi_08_c)] <- 0
# Now we can summarize
phi_08_c <- summarize(phi_08, Age=mean(Age.x), PA=sum(PA), AB=sum(AB), RS=sum(R.x), H=sum(H.x), Doubles=sum(`2B`), Triples=sum(`3B`), HR=sum(HR.x), BB=sum(BB.x), Batting_SO=sum(SO.x), W=sum(W), L=sum(L), IP=sum(IP), H_allowed=sum(H.y), RA=sum(ER), Pitching_SO=sum(SO.y), BB_allowed=sum(BB.y))
phi_08_c <- phi_08_c %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2008)
head(phi_08_c)
##   Age   PA   AB  RS    H Doubles Triples  HR  BB Batting_SO  W  L     IP
## 1  30 6273 5509 799 1407     291      36 214 586       1117 92 70 1449.2
##   H_allowed  RA Pitching_SO BB_allowed Team Year
## 1      1444 625        1081        533  PHI 2008

Now let’s compare to the original summary given by baseball-reference

head(phi_08)
##          Name Rk.x Pos.x Age.x G.x   PA   AB R.x  H.x  2B 3B HR.x RBI  SB
## 1 Team Totals   NA          30 162 6273 5509 799 1407 291 36  214 762 136
##   CS BB.x SO.x    BA   OBP   SLG  OPS OPS+   TB GDP HBP.x SH SF IBB.x
## 1 25  586 1117 0.255 0.332 0.438 0.77   99 2412 108    67 71 40    68
##   bat_side Rk.y Pos.y Age.y  W  L  W-L%  ERA G.y  GS  GF CG SHO SV     IP
## 1        R   NA        30.3 92 70 0.568 3.88 162 162 158  4   3 47 1449.2
##    H.y R.y  ER HR.y BB.y IBB.y SO.y HBP.y BK WP   BF ERA+ FIP  WHIP H9 HR9
## 1 1444 680 625  160  533    64 1081    57  3 34 6229  112 4.3 1.364  9   1
##   BB9 SO9 SO/W pitch_side
## 1 3.3 6.7 2.03          R

Now we can properly set up the rest of the Philly dataframes:

phi_08 <- select(phi_08, c(4:12,16:21,33:36,45,46,57))
phi_08 <- phi_08 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2008)

phi_09 <- merge(filter(phi_09_bat,Name=="Team Totals"),filter(phi_09_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_09 <- select(phi_09, c(4:12,16:21,33:36,45,46,57))
phi_09 <- phi_09 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2009)

phi_10 <- merge(filter(phi_10_bat,Name=="Team Totals"),filter(phi_10_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_10 <- select(phi_10, c(4:12,16:21,33:36,45,46,57))
phi_10 <- phi_10 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2010)

phi_11 <- merge(filter(phi_11_bat,Name=="Team Totals"),filter(phi_11_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_11 <- select(phi_11, c(4:12,16:21,33:36,45,46,57))
phi_11 <- phi_11 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2011)

phi_12 <- merge(filter(phi_12_bat,Name=="Team Totals"),filter(phi_12_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_12 <- select(phi_12, c(4:12,16:21,33:36,45,46,57))
phi_12 <- phi_12 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2012)

phi_13 <- merge(filter(phi_13_bat,Name=="Team Totals"),filter(phi_13_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_13 <- select(phi_13, c(4:12,16:21,33:36,45,46,57))
phi_13 <- phi_13 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2013)

phi_14 <- merge(filter(phi_14_bat,Name=="Team Totals"),filter(phi_14_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_14 <- select(phi_14, c(4:12,16:21,33:36,45,46,57))
phi_14 <- phi_14 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2014)

phi_15 <- merge(filter(phi_15_bat,Name=="Team Totals"),filter(phi_15_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_15 <- select(phi_15, c(4:12,16:21,33:36,45,46,57))
phi_15 <- phi_15 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2015)

phi_16 <- merge(filter(phi_16_bat,Name=="Team Totals"),filter(phi_16_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_16 <- select(phi_16, c(4:12,16:21,33:36,45,46,57))
phi_16 <- phi_16 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2016)

phi_17 <- merge(filter(phi_17_bat,Name=="Team Totals"),filter(phi_17_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_17 <- select(phi_17, c(4:12,16:21,33:36,45,46,57))
phi_17 <- phi_17 %>%
  mutate(Team = "PHI") %>%
  mutate(Year = 2017)

Once again, we’ll hide the other dataframe calculations for berevity.

Now that that has been done, let’s merge the summaries into one final dataframe. We also add in some of the more interesting averages that we couldn’t do with summary because of weighting.

sum_df <- Reduce(function(x, y) merge(x, y, all=TRUE), list(phi_08, phi_09, phi_10, phi_11, phi_12, phi_13, phi_14, phi_15, phi_16, phi_17, atl_08, atl_09, atl_10, atl_11, atl_12, atl_13, atl_14, atl_15, atl_16, atl_17, chc_08, chc_09, chc_10, chc_11, chc_12, chc_13, chc_14, chc_15, chc_15, chc_16, chc_17, hou_08, hou_09, hou_10, hou_11, hou_12, hou_13, hou_14, hou_15, hou_16, hou_17, nyy_08, nyy_09, nyy_10, nyy_11, nyy_12, nyy_13, nyy_14, nyy_15, nyy_16, nyy_17)) %>%
  rename(Age=Age.x) %>%
  rename(G=G.x) %>%
  rename(RS=R.x) %>%
  rename(H=H.x) %>%
  rename(HR=HR.x) %>%
  rename(BB=BB.x) %>%
  rename(Batting_SO=SO.x) %>%
  rename(Winning_pct=`W-L%`) %>%
  rename(RA=R.y) %>%
  rename(Doubles=`2B`) %>%
  rename(Triples=`3B`)

head(sum_df, 10)
##     Age   G   PA   AB  RS    H Doubles Triples  HR  BB Batting_SO    BA
## 1  25.5 162 6055 5447 629 1317     240      19 163 495       1442 0.242
## 2  25.9 162 6020 5457 610 1307     266      16 148 426       1535 0.240
## 3  26.3 162 6073 5459 729 1363     278      26 230 486       1392 0.250
## 4  26.5 162 6133 5535 690 1382     287      36 174 494       1417 0.250
## 5  26.5 162 6204 5545 724 1367     291      29 198 554       1452 0.247
## 6  26.6 162 6014 5407 583 1276     238      28 146 463       1365 0.236
## 7  26.7 162 6102 5508 614 1315     270      31 157 442       1477 0.239
## 8  26.7 162 6200 5491 689 1341     272      30 171 567       1518 0.244
## 9  26.8 162 6064 5468 573 1316     240      22 123 472       1369 0.241
## 10 26.8 162 6133 5441 688 1354     247      21 181 542       1384 0.249
##      OBP   SLG   OPS  W   L Winning_pct  ERA  RA  ER  WHIP Team Year
## 1  0.309 0.383 0.692 70  92       0.432 4.11 723 657 1.335  HOU 2014
## 2  0.299 0.375 0.674 51 111       0.315 4.79 848 766 1.490  HOU 2013
## 3  0.315 0.437 0.752 86  76       0.531 3.57 618 572 1.201  HOU 2015
## 4  0.315 0.409 0.723 66  96       0.407 4.55 782 729 1.387  PHI 2017
## 5  0.319 0.417 0.735 76  86       0.469 4.09 729 654 1.385  HOU 2016
## 6  0.302 0.371 0.673 55 107       0.340 4.56 794 721 1.428  HOU 2012
## 7  0.300 0.385 0.684 73  89       0.451 3.91 707 636 1.300  CHC 2014
## 8  0.321 0.398 0.719 97  65       0.599 3.36 608 546 1.152  CHC 2015
## 9  0.305 0.360 0.665 79  83       0.488 3.38 597 547 1.265  ATL 2014
## 10 0.321 0.402 0.723 96  66       0.593 3.18 548 512 1.196  ATL 2013

Could we have gotten to this point in an easier fashion? Sure. We could have pulled the summary statistics for each team across all years from baseball-references website, but now we have an idea as to what can go into these kinds of projects.

Data Analysis

Now that we have all this data that we’ve reduced down to, what does it mean? Well, as it turns out, there was some data loss in all that transformation, but we’re going to roll with it, as there is still enough data to make a point. It could mean a whole lot of things, and one of the easiest ways is to plot it. R makes plotting fairly straightforward with ggplot, so let’s get down to it and see how age and winning percentage align.

ggplot(sum_df, aes(x=Year, y=Winning_pct, colour=Team)) + geom_point() + geom_line()  +
  scale_color_manual(values=c("darkblue","blue","orange","black","red"))

ggplot(sum_df, aes(x=Year, y=Age, colour=Team)) + geom_point() + geom_line()  +
  scale_color_manual(values=c("darkblue","blue","orange","black","red"))

ggplot(sum_df, aes(x=Age, y=Winning_pct, colour=Team)) + geom_point() +
  scale_color_manual(values=c("darkblue","blue","orange","black","red"))

Now, we can get a little fancy, creating a 3d graph using plotly

plot_ly(sum_df, x = ~Year, y = ~Age, z = ~Winning_pct, color = ~Team, colors = c("darkblue","blue","orange","black","red")) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Year'),
                     yaxis = list(title = 'Age'),
                     zaxis = list(title = 'Winning Percentage')))

The Age vs Winning Percentage Graph tells a lot, if you know some of the numbers behind the data. There are four main spikes in winning percentage, one at roughly 27, one at 28.7, one at 30.7, and one at 31.7. This seems to disprove our original idea that young teams are the way to go. However, consider the first two spikes. The tend is generally linear, so the first spike is rather strange. It’s where teams are first starting to get that young talent in the majors. With a few years experience, and some added veterens, that average age will go up, but so will the wins.

One other thing we can discuss about this data is the central tendency, spread, skew, and outliers. Central tendency is the data averages to. For example, most of the ages in our original datasets averaged to roughly 28 years old. The spread is the range that this data is between. For example, in 2018 Ichiro Suzuki is an impressive 44 years old. Spread also relates to where the data lies, most players tend to be in the 27 to 30 range. Speaking of Ronald Acuna and Ichiro Suzuki, both are examples of outliers. Outliers are data points that are significantly outside the expected range. Almost all players won’t make it to their age 40 season, let alone their 44th. Skew is whether that data tends to lean to one side of the average or the other.

The best way to see these statistical trends is through boxplots. Below, we show boxplots for each team and their average age in each year.

ggplot(sum_df,aes(x=Team, y=Age,group=Team)) +
    geom_boxplot()  + ylab("Age") + xlab("Team")

Here we notice some interesting things about each team. Houston has the largest spread, which makes sense. Houston, over the past two or three years, has been going out to get veteran players to help complete their team. Meanwhile, the Yankees have the lowest spread, and the highest central tendency. More than the other teams here, they have the money to go out and get top players, who are typically older. The Phillies, who have the second highest spread, are also significantly skewed. Most of their players have been in the league a long time, so most of the players are pretty old (as far as player standards go). Then, as they rebuilt, they went out and got veteran players who weren’t top tier, but might have a breakout season, keeping that average age up. In more recent years, the Phillies have called up their young players, increasing the spread.

Hypothesis Testing

Pythagorean Win Loss is one of the more interesting statistics that sabermatricians have come up with. Given the amount of runs scored for and against a team, it will compute what their expected winning percentage will be. In this example, we’ll do some hypothesis testing to see how valid Pythagorean Win Loss is against our dataset.

First, we have to add the expected winning percentage to our dataset. Pythagorean Win Loss is calculated as \[\text{win}\% = \frac{RS^2}{RS^2+RA^2}\] Depending on the sport, those exponents can change, tweaking the winning percentage to fit correctly.

sum_df <- mutate(sum_df, pyth=(RS^2)/((RS^2) + (RA^2)))

we can model winning percentage as a Bernoulli distribution where each game will give us one of two results: a win or a loss, with the odds being the winning percentage. So that means that \(\hat{p}\) is the winning percentage, which we can use to calculate the varience. From there, we can calculate the odds of the Pythagorean Win-Loss being outside of our winning percentage model. Normally, if looking for statistical improvement, you would want your predicted odds to be outside the realm of possible values for the original probability, but in our case, we want the odds to always be in that realm of possible values. If the odds are less than 5%, then we can reject the null hypothesis and say that the Pythagorean Win-Loss model is invalid.

hyp_df <- sum_df %>%
  select(Team,Year,G,W,L,Winning_pct,pyth) %>%
  mutate(winning_var = Winning_pct*(1-Winning_pct)/G) %>%
  mutate(p = 1-pnorm(pyth,Winning_pct,winning_var^.5)) %>%
  mutate(valid = ifelse(p > 0.95 | p < 0.05,FALSE,TRUE))

As we can see, that there is statisticially little chance of being outside the realm of possible values for winning percentage is very little. Only two values had winning percentages significantly outside their Pythagorean Win-Loss expected winning percentage. Both the 2016 Phillies and Yankees beat their expected Pythagorean Win-Loss. Thus, we can say with statistical certainty that Pythagorean Win-Loss is an accurate metric.

Machine Learning

Machine Learning is a hot topic in computer science. Machine learning uses statistics to help a computer improve its performance by itself. For example, Elon Musk’s OpenAI company developed a computer that beat professional eSports players. Using “trial by error” techniques, the computer was able to teach itself how to play a simplified version of Dota 2, a game with a notoriously large learning curve. It completely shut down beloved pro player Dendi.

In our example, we’ll cut out 2016-2017 to be our test year and use the rest to train a random forest to predict if a team has a winning percentage. A random forest is a collection of decision trees that it uses to predict the outcome. Each decision tree takes the input and, like a binary search tree, sends that input further down the line until it reaches the final node, which will give the prediction and it’s likelyhood. For our example, we will use all the possible statistics except for year and team. Note that the function randomForest comes from the randomForest package.

#First, add conditional if the team had a winning record
final_df <- mutate(sum_df, winning_record=ifelse(Winning_pct>0.5,"YES","NO"))
final_df <- mutate(final_df, winning_record=factor(winning_record, levels=c("YES", "NO")))
final_df <- select(final_df, -c(Team,W,L,Winning_pct))

#then, we strip out the 2017 year
test_df <- filter(final_df,Year==2017|Year==2016)
final_df <- filter(final_df,Year!=2017|Year!=2016)

#now we create the random forest
rf <- randomForest(winning_record~., data = final_df %>% select(-Year))

#now we see how well our prediction did
test_predictions <- predict(rf, newdata=test_df %>% select(-Year))

#and print out our results
table(pred=test_predictions, observed=test_df$winning_record)
##      observed
## pred  YES NO
##   YES   3  0
##   NO    0  6

Looks like our random forest predictor is spot on. Every time it predicted a winning record, the team got a winning record, which is pretty good.

Glossary of Terms

General Terms

Sabermetrics - The study of baseball statistics to evaluate how well teams and players do

Farm System - A nickname for the collection of a franchise’s minor league teams. This is where young players develop their craft against other young players, and some veteran players.

The Show - Also called the Bigs, its another name for Major League Baseball

WAR - Wins Above Replacement, an arbitrary calculation to compare how good a team is compared to another.

Dataframe - An object/matrix that stores the data we are interested in. Each column contains a particular attribute for each row, or entry.

CSV - Comma Separated Values, a plaintext file where each value is separated by a comma and each entry separated by a newline character. It is an easy way to store data.

Definition of Statistics

G - Number of Games played

PA - Plate Appearances, number of times a player has come to bat

AB - At-Bats, number of at bats a player has had. It’s different from plate appearances as walks don’t count towards at-bats

RS - Runs Scored, number of times a team has scored

H - Hits, number of times a player has reach at least first base with a ball put in play

2B - Doubles, number of times a player has reached second safely on a ball hit in play

3B - Triples, number of times a player has reached third safely on a ball hit in play

HR - Homeruns, number of times a player has rounded the bases by either hitting the ball over the outfield fence in fair territory, or by rounding all four bases on a single ball hit in play

BB - Batter Based, number of times a batter has walked. In hitting, its how many times the batter has walked; in pitching its how many times a pitcher has walked an opposing batter

SO - Strike Outs, also known as a K. In hitting, its the number of times a batter has struck out and in pitching its how many strikes a pitcher has collected

BA - Batting Average, also AVG, its the average number of times a player has reached at least first safely. Walks, errors, and fielder’s choices don’t count

OBP - On-Base Percentage, the average number of times a player has reached first either by a hit or by a walk

SLG - Slugging Percentage, weighted average each type of hit. Each one is weighed as such: singles by one, doubles by two, triples by three, and homeruns by four.

OPS - On-base Plus Slugging, pretty self explanatory

W - Wins, for a pitcher, its the number of times their team has taken the lead while they were pitching and not given the lead back up. For a team, its the total number of wins

L - Losses, like wins, for a pitcher, its the number of times they have given up the lead and their team has not gotten it back. For a team, its their total number of losses.

Winning Percentage - Pretty self explanatory, its the number of wins over the total number of games played

ERA - Earned Run Average, its the average number of runs a pitcher gives up in 9 innings of work

RA - Runs allowed

ER - Earned Runs Allowed, this subtracts runs given up because of errors

WHIP - Walks and Hits per Inning Pitched, the number of walks and hits a pitcher gives up on average per inning of work